Experiments on using Yahoo! categories to describe documents
نویسندگان
چکیده
We suggest that one (or a collection) of names of Yahoo! (or any other WWW indexer’s) categories can be used to describe the content of a document. Such categories offer a standardized and universal way for referring to or describing the nature of real world objects, activities, documents and so on, and may be used (we suggest) to semantically characterize the content of documents. WWW indices, like Yahoo! provide a huge hierarchy of categories (topics) that touch every aspect of human endeavors. Such topics can be used as descriptors the way librarians use for example, the Library of Congress cataloging system to annotate and categorize books. In the course of investigating this idea, we address the problem of automatic categorization of webpages in the Yahoo! directory. We use Telltale as our classifier; Telltale uses n-grams to compute the similarity between documents. We experiment with various types of descriptions for the Yahoo! categories and the webpages to be categorized. Our findings suggest that the best results occur when using the very brief descriptions of the Yahoo! categorized entries; these brief descriptions, which are part of the Yahoo! index itself accompany most entries. We discuss further research and ways to improve on the performance of our method.
منابع مشابه
Web Search Using Automatic Classification
We study the automatic classification of Web documents into pre-specified categories, with the objective of increasing the precision of Web search. We describe experiments in which we classify documents into high-level categories of the Yahoo! taxonomy, and a simple search architecture and implementation using this classification. The validation of our classification experiments offers interest...
متن کاملIntegrating Multiple Internet Directories by Instance-based Learning
Finding desired information on the Internet is becoming increasingly difficult. Internet directories such as Yahoo!, which organize web pages into hierarchical categories, provide one solution to this problem; however, such directories are of limited use because some bias is applied both in the collection and categorization of pages. We propose a method for integrating multiple Internet directo...
متن کاملAnalyzing Users’ Health Information Needs Based on the Yahoo Answers®
Background and Aim: People refer to virtual information resources for answering their medical questions. One of these resources includes question and answering (Q&A) sites in medicine. This study aims to analyze health questions posted on the Yahoo Answers to identify health information needs, the motivations for asking questions, evaluation of information user satisfaction resulted from recei...
متن کاملCategorisation by Context
Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کامل